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ABSTRACT. We produce approximation bounds on a semidefinite programming relaxation for sparse principal 
component analysis. These bounds control approximation ratios for tractable statistics in hypothesis testing 
problems where data points are sampled from Gaussian models with a single sparse leading component. 

(N ■ 

We study approximation bounds for a semidefinite relaxation of the sparse eigenvalue problem, written 
\ here in penalized form 

| max x T Y,x — p Card(a;) 

IM|2 = 1 

in the variable x G W 1 , where £ G S n and p > 0. Sparse eigenvalues appear in many applications in statis- 
tics and machine learning. Sparse eigenvectors are often used, for example, to improve the interpretability of 
principal component analysis, while sparse eigenvalues control recovery thresholds in compressed sensing 
r— T [Candes and Tao, 2007]. Several convex relaxations and greedy algorithms have been developed to find ap- 

u: proximate solutions (see d'Aspremont et al. [2007, 2008], Joumee et al. [2008], Journee et al. [2008] among 

others), but except in simple scenarios where p is small and the two leading eigenvalues of S are separated, 
c | ' very little is known about the tightness of these approximation methods. 

Here, using randomization techniques based on [Ben-Tal and Nemirovski, 2002], we derive simple ap- 
proximation bounds for the semidefinite relaxation derived in [d'Aspremont, Bach, and El Ghaoui, 2008]. 
■ We do not produce a constant approximation ratio and our bounds depend on the optimum value of the semi- 

definite relaxation: the higher this value, the better the approximation. A similar behavior was observed by 
Zwick [1999] for the semidefinite relaxation to MAXCUT, who showed that the classical approximation 
t-h ■ ratio of Goemans and Williamson [1995] can be improved when the value of the cut is high enough. 

We then show that, in some applications, it is possible to bound a priori the optimum value of the semi- 
definite relaxation, hence produce a lower bound on the approximation ratio. In particular, following recent 
works by [Amini and Wainwright, 2009, Berthet and Rigollet, 2012], we focus on the problem of detecting 
the presence of a (significant) sparse principal component in a Gaussian model, hence test the significance of 
£NJ ■ eigenvalues isolated by sparse principal component analysis. More precisely, we apply our approximation 

results to the problem of discriminating between the two Gaussian models 

AT(0,I n ) and AT (0,I n + 6vv T ) 

' 

where v € M n is a sparse vector with unit Euclidean norm and cardinality k. We use a convex relaxation for 
the sparse eigenvalue problem to produce a tractable statistic for this hypothesis testing problem and show 
that in a high-dimensional setting where the dimension n, the number of samples m and the cardinality k 
grow towards infinity proportionally, the detection threshold on 9 remains finite. 

More broadly speaking, in the spirit of smoothed analysis [Spielman and Teng, 2001], this shows that 
analyzing the performance of semidefinite relaxations on random problem instances is sometimes easier and 
provides a somewhat more realistic description of typical approximation ratios. Another classical example 
of this phenomenon is a MAXCUT-like problem arising in statistical physics, for which explicit (asymptotic) 
formulas can be derived for certain random instances, e.g. the Parisi formula [Mezard et al., 1987, Mezard 
and Montanari, 2009, Talagrand, 2010] for computing the ground state of spin glasses in the Sherrington- 
Kirkpatrick model. It thus seems that comparing the performance of convex relaxations on random problem 
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instances (e.g. in detection problems) often yields a more nuanced understanding of their performance in 
cases where uniform approximation ratios are hard to derive or analyze. 

The paper is organized as follows. The next section recalls a few definitions on sparse eigenvalue prob- 
lems. Section 2 recalls the construction of the semidefinite relaxation in [d'Aspremont et al., 2008]. Sec- 
tion 3 derives approximation bounds on the solution of this relaxation. Section 4 studies the performance 
of these approximation bounds on the sparse eigenvector detection problem. Section 5 presents some algo- 
rithms for solving the semidefinite relaxation used as a test statistic in the detection problem. Finally, we 
present some numerical results in Section 6. 



1. Sparse eigenvalues 

We begin by formally defining sparse eigenvalues. Let E S S„ be a symmetric matrix. We define the 
sparse maximum eigenvalues of the matrix E as 

A max (E) = max. x T T>x 

s.t. Card(x) < k (1) 

1Mb = 1, 

in the variable x G M n where the parameter k > controls the sparsity of the solution. We can similarly 
define sparse minimum eigenvalues as 

A min (E) = min. x T Ex 

s.t. Card(x) < k (2) 
1Mb = 1, 

in the variable x G M n . Because A max (E + al) is affine in a, we have 

^min( S ) = ^max(E) - A max (A max (E)I - E) 

and the following sections will be focused on approximating A max (E). 



2. Semidefinite relaxation 

Here, we first recall the semidefinite relaxation for (1) derived in [d'Aspremont, Bach, and El Ghaoui, 
2008]. We assume without loss of generality that E G S n is positive semidefinite (we can always add a 
multiple of the identity) and that the n variables are ordered by decreasing marginal variances, i.e. that 
> • • • > We also assume that we are given a square root A of the matrix E with 

E = A T A, 

where A G IR nxn and we denote by a\, . . . , a n G M. n the columns of A. Note that the problem and our 
algorithms are invariant by permutations of E and by the choice of square root A. In practice, we are very 
often given the data matrix A instead of the co variance E. As we will see below, we can directly exclude 
variables for which Ejj < p, hence we can assume w.l.o.g. that 

< p < min Ejj. 

ie[l,n] 

If this condition is not satisfied, the variable i will never be part of the optimal support and we can focus on 
the reduced problem. 
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2.1. Relaxation bounds on sparse eigenvalues. We can rewrite the maximum eigenvalue problem in terms 
of the data matrix A. We start by writing 

A^L = max x T Sx 

maxV ' Cnrd(a:)<k 

|M|2 = 1 

= max A max VVajof 

Ti°' i} ; vfe / 

1 J u=k 

n 



max max y^Mjfa^xV 

ne{0,l} n |W| 8 =1 
l T u=fc i_i 
n 

max max >^ Uj(aTx) 2 , 
||:s|[ 2 =l «e{o,i}« f-^ 



and use the fact that 



for any 6 G R ra , to write 



max > Uj6j = min < > (6j — p)+ + pfc 
uSfOJl" f-f p>o ^ v 



max max T^UjfaJ'x) 2 
di 2 =l «e{o,i}" f-f * 

r n 

max mm < ((afx) 2 — p) + + pk 



. i=l 



max min ^ V (X 1 / 2 (a i af - P I)X l l 2 j + pfe L 

Rank(X)=l p>0 f-'V / + 

XfcO,Tr(X)=l U_1 ' 

where the last equality follows from the fact, when Rank(X) = 1 the only potentially nonnegative eigen- 
value of (X 1 / 2 (ajaf — pl)X l l 2 \ is (afx) 2 — p, and X 1 / 2 = X = xx T when \\x\\2 = 1. We then produce 
a semidefinite relaxation for ( 1 ) by simply dropping the rank constraint to get the following bound 



Amax(S) < min < 
p>0 



max 

Tr(X) 



n 

i Y,(x 1/2 (Wi-pVX 1/2 ) + + Pk 



(3) 



which is equivalent to a semidefinite program. Note that because Rank(X) = 1 defines a non-convex 
set, we cannot simply switch the min and the max and this last inequality is potentially strict, i.e. the 
semidefinite relaxation only produces an upper bound on A^ ax (S). 

2.2. Penalized problem. We now focus on the inner optimization problem in (3). Starting from a penalized 
version of problem (1), written 

4>(p) = max x T Y,x — p Card(x) (4) 

IM|2 = 1 



it was shown in d'Aspremont et al. [2008] that 

n 

Hp) = max V((afx) 2 -p) 

N|2 = l ^ 



71 

max V TV ( X 1/2 ( ai aJ - pI)X 1/2 ) 



Rank(X) . 
XhO,Tr(X)=l % ~ l 



and we write ip(p) the semidefinite relaxation of this last problem 

Hp) = max. Zti Tr{XV*aiaTxV 2 - pXy 
s.t. Tr(X) = 1, X y 0, 



(5) 



which is equivalent to a semidefinite program [d'Aspremont et al., 2008] and is the inner problem in (3). In 
the next section, we use this quantity as a test statistic for detecting significant sparse eigenvectors. 

3. Approximation Bounds 

Using the randomization argument detailed in [Ben-Tal and Nemirovski, 2002, El Ghaoui, 2006], we can 
derive an explicit bound on the quality of the semidefinite relaxation (5). 

Proposition 3.1. Let us call X the optimal solution to problem (5) and let r = Rank(X), we have 



where 



controls the approximation ratio. 



np ■!?,, 
Mx) = E 



Hp) 



np 



< <P(p) < Hp), 



(6) 



(7) 



Proof. We can assume w.l.o.g. that < p < min^n n ] Y^u. This means that Bi{X) = X 1 ^ 2 (aiaf — 

pl)X 1 / 2 has rank r and exactly one positive eigenvalue aj, with = TV Bi(X) + for i = 1, . . . ,n. 
We also denote by — /3j for j = 2, . . . , k, the (k — 1) negative eigenvalues of Bi{X). We follow the 
randomization procedure in [Ben-Tal and Nemirovski, 2002, El Ghaoui, 2006] and let £ denote normally 
distributed variables on W 1 , we have, using the rotational invariance of the normal distribution 



E 



(^(X)£). 



E 



for i = 1. 



, n. 



We then get 

r 

% = Tr(B(X))+ - Tr(B(X)) = a t - (afXa t - p) < p 
because A max (-E>i(^Q) < aj Xai, hence 



E l(eBi{X)i)- 



> min < E 



>:'i 2 % 



■ E-= 2 /3j<P,/3j>o} 



E 



a 



c2 p_ V r f.2 

SI r I ■ 2 \/ 



by convexity and symmetiy. By homogeneity and convexity, with tp(p) = Y^i=\ a «> we tnen § et 



E 



Y,(eB t (x)o- 



i=i 



> 



1=1 
> E 



a 



.£2 P_ V r F 2 



and we define i!} r (x) as in(7) above. Having shown 
and using E[£ T X£] = Tr(X) = 1, we get 

E[£?=i(e T £ { .poo + ] >u P a(^) next], 



and this bound implies that there exists a nonzero £ such that 



1=1 



Suppose we set 



we now have 



i if (eBi(x)ti) >o 

otherwise, 



which is also, with z = X 1 ^ a nd B l {X) = XBiX 

/ n \ 

fi-Bi J z > np i? 



z T ( , T^lz>n«^(^) z T z. 



vi=l / 



This finally means that for our choice of v, with B{ = a%a\ — p 

4>(p) = max A max ( Y] Uiaiaf | - p Card(u) > A max ( Y] Uj-B, ) > np $ r ftiEl\ 

hence our lower bound n/3 ■d r {ip{p) jnp) < 4>{p) (which holds whenever X is feasible in (5), i.e. whenever 
X y with Tr X = 1). Furthermore, if X is an optimal solution of the relaxation in (5), we also get an 
upper bound on (j)(p), with 

which is the desired result. 



V np J 



An explicit formula involving trigonometric integrals was derived in [El Ghaoui, 2006] (note that our 
definition for the function is slightly different here). When r is large, we can approximate # r (-) by the 
function 

^) = e[(^ 2 -1) + ] (8) 
where £ ~ A/"(0, 1). Indeed, the central limit theorem shows that 

^£fe 2 -l)^AA(0,l), 
r — 1 ^-^ 

when r grows to infinity. By convexity, we also have < d r (x). The function -#(•) can be computed 
explicitly, with 

$(x) = E \{xi 2 - l) 



2 / , xu 2 - du-2J\f l-x~2 

J x -h V2^ v 

2e~ l l 2x / 



V2 



7TX 



+ 2(x - l)M [-X 



2 



where M(-) is the Gaussian cumulative density function. As with all results based on the central limit 
theorem, the approximation starts to be very good at relatively low values of r. The fact that $'(0 + ) = 
means that we cannot obtain a constant approximation ratio (a la MAXCUT). However, because is 
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convex and increasing, we can still derive meaningful lower bounds in (6) if we can bound ip(p)/np from 
below by a given c > 0, with 



%)< Bp tff^< W) whenc<M. 
c \ np J np 



(9) 



We also observe that i/j(0 + ) = A max (S) > 0, and lim^oo $(x)/x = 1 mean that, when n is fixed, we have 

'd(i/j(p)/np) ^ 

tp(p)/np p^o 

i.e. the approximation ratio converges to one as p goes to zero (and solutions get less sparse). In fact 



d{x)/x = 1 



,-1/2 



X 



_1 + 0(x 3 / 2 ), as x -> oo. 



We illustrate these last points by plotting and $(x)/x on the interval [0, 2] in Figure 1, together with 
the function $ r (x) for r = 5. 



3, 





FIGURE 1. Left: The functions -d{x) (solid black line), and $5(2:) (dotted blue line). Right: 
The functions $(x)/x (solid black line), and $5 (x) / x (dotted blue line). 



Remark also that a naive lower bound on ip(p)/np, hence on the approximation ratio, can be obtained by 
plugging X = I/n in problem (5), which yields 

Hp) > £J=i( s «-p)+ 



np 



n 2 p 



Tr(S)-p 



n 2 p 



because we have assumed that p < min^ Ejj. This shows for example that when 

Tr(£) 



P< 



n 2 + 1 



then the approximation ration is greater than We will see below that this range of values for p is 

actually quite typical in detection problems where the matrix £ is Wishart. 
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4. Detection problems 



T\ (10) 



In this section, we focus on the problem of detecting the presence of a sparse leading component in a 
Gaussian model. It was shown in [Berthet and Rigollet, 2012] that the sparse eigenvalue statistic is minimax 
optimal in this setting. Computing sparse maximum eigenvalues is NP-Hard, but we show here that the re- 
laxation detailed in the previous section achieve detection rates that are a multiple of the minimax optimum, 
in a high-dimensional setting where the ambient dimension n, the number of samples m and the sparsity 
level k all grow towards infinity proportionally. More specifically, we focus on the following hypothesis 
testing problem, where 

Ux : x ~ M (0, I n + Ovv 

where 9 > and v E R n is a sparse vector satisfying Card(w) < k* and \\v\\2 = 1. Given m sample 
variables Xi £f", we let S G S n be the sample covariance matrix, with 

. m 

s = y xiXi . 

i=l 

We will now seek to bound the value of the statistics cft(p) and tp(p) defined in (4) and (5) respectively, under 
the two hypotheses above. 

4.1. The optimal statistic <p(p). We start by the easy part, namely bounding (p(p) from below under %\. 
Proposition 4.1. Given S € S n under we have 



Hp) > i + 9- pk* -2(1 + e)J log( - 1/6) (ii) 

1 m 



with probability 1 — 5. 



Proof. By construction, we have <p(p) > A^* ax (S) - pk*, and [Berthet and Rigollet, 2012, Prop. 4. 1] then 
yields the desired result. ■ 

Using again the results in [Berthet and Rigollet, 2012], we now show an upper bound on the value of the 
statistic (j>(p) under Hq. 

Proposition 4.2. Given S G S n under T-Lq, and assuming p > A/m, where 

A = 41og(9en/A;*) + 41og(l/«J) 

we have 

Ak*p 1 

*<^ 1 + ^ + W^T <12) 

with probability 1 — 25 when 5 is small enough. 

Proof. Under Hq, [Berthet and Rigollet, 2012, Prop. 4.2] shows 

Prob 



A^x(S) - pk > 1 - pk + 4 v / t/^ + At/ 



m 



and writing 1 + u = 1 — pk + Av/y/m + Av 2 /m, with v = y/i, yields 

—\/m+ \Jm(l + u + pk) 



hence 



Prob \\t^)-pk>l + u 
i 



Using the fact that <p(p) = maxj A^ ax (S) — pk and (?) < ( s r) k we then get, using union bounds 

Prob [</>(p) > 1 + u] <^exp (fc log ^ - ^(^(1 + « + pk) " I) 2 ) • 

k=i ^ ' 

We write 

Hog— -{y / (l + u + pk) - l) 2 = Hog— -(V(l + u + P*0 - 1) + ( l °S k * -logfc)fe. 

fc 4 fc* 4 

When a < 1, the function 

(Vl + x - l) 2 - ax 

is convex and reaches its minimum when x = 1/(1 — a) 2 — 1, with value —a 2 / (1 — a). A similar argument 
shows that (log k* — log k)k < k*/e. Setting x = u + pk we get 

au + ap/c — (i/l + it + pk — l) 2 < a 2 / (1 — a) 

Setting a = A/pm above, imposing p > A/m, and 

2 .. . 4k* 

au = a /(l — a) H 

em 

we can ensure 

™ & + (log F - log fc)— - (y/(l + u + pk) - l) 2 ] < 
4 \ra m / 

for all k > 1, hence 

k log ^ - ^(VU + n + pfc) - l) 2 + (log k* - log fc)fc < —k log(l/<5) 
fc* 4 

and 

Prob [0(p) > 1 + u] < 



1-5 

which yields the desired result. ■ 

We now use these last two results to determine the minimum signal level 9 which can be detected using 
the statistic 4>(p). We define the following levels 

^ _ -| i_ / fc*(A+4/e)" , 4fc* , 4 / fc*A 

y m + em + eA\/ m(l+4/(eA)) ^ 

for some 7 > 0. Given £ G S n and G [to, t{\, the corresponding test is given by 

1{4>(p)>t^} (14) 

The following proposition shows that if 9 is high enough, then this test discriminates between Ho and Hi 
with probability 1 — 35. 

Proposition 4.3. Suppose we set 

A = 41og(9en/fc*) + 41og(l/<5) and p = — + — A (15) 

m ^k*m(A + 4/e) 

a«<i define such that 



Vm m V m y \ V m J 

then if 9 > 9$ in the Gaussian model (10), the test statistic (14) based on <fi(p) discriminates between Ho 
and Hi with probability 1 — 35. 



Proof. If 9$ is set as in (16), setting p as in (15) means A/mp < 1, we have to < t\ and propositions 4.2 
and 4.1 show that (14) discriminates between 7~Lq and H%.M 



This detection level was shown to be minimax optimal in [Berthet and Rigollet, 2012]. This is not 
surprising, since the statistic <f>(p) is simply a penalized formulation of A^ iax (-) which was shown to reach a 
similar detection level in [Berthet and Rigollet, 2012]. Both <p(p) and A^ ax (-) are intractable however, and 
we will now focus on an efficiently computable statistic based on ip(p). 

4.2. The tractable statistic i/)(p). We can directly use proposition 4.1 to produce a lower bound on ip{p) 
and on the approximation ratio under both %q and %\. 

Corollary 4.4. Setting A as in (15), under both Hq and Hi, we have 



«p)>l-pr-2^iM> 
V m 

with probability 1 — 5. 

Proof. We simply set 9 = in proposition (4.1) and use the fact that <p(p) < ip(p) by construction. ■ 

We are now ready to prove the main result of this section, showing that in a high dimensional setting, the 
tractable statistic ip(p) discriminates between Ho and Hi when 9 > 9^, where 9^ is comparable to 9$, and 
9^, is independent of n. The approximation ratio in (9) is controlled by np/i/j(p) which depends on p so 
we cannot explicitly minimize the detection level in p as we did above. Instead, we will control the quality 
of the approximation of 0(p) by ip(p) for the value of p used in computing 9$. We suppose n = pm and 
k* = kti, where p > and k G (0, 1). Setting p as in (15), we get 

np = pA H = 

with Corollary 4.4 implying 



^p)>l-pA K *" H nJWW 



V(A + 4/e) 



rn 



This means that the approximation ratio in (9) is bounded below by j3(p, k), with 



i-pAk- ^ -2J l -^m 

(3{p,K) = where c= -j . (17) 

P V«(A+4/e) 

Given S G S n and r^, G k) _1 to, tj], where ro and n ai - e defined in (13), the corresponding test is 
then 

1 W(P)>^} ( 18 ) 
with p set as in (15). The following proposition shows that if 9 is high enough, then this test discriminates 
between T-Lq and Hi with probability 1 — 35. 

Theorem 4.5. suppose n = pm and k* = ktl, where p > and k G (0,1) are fixed and n is large. Define 
the detection threshold 9f such that 

9 f >/3(p 1 K)- 1 9 lj} (19) 

where (3(p, k) is defined in (17) and 9^ is defined in (16), then if 9 > 9^ in the Gaussian model (10) the test 
statistic (18) based on ip(p) discriminates between %q and Hi with probability 1 — 35. 

Proof. Having bounded the approximation ratio j3(p, k) defined in (17), the result follows from (9) and 
Proposition 4.3. ■ 

In Figure 2, we plot the level sets of f3(p, k) for A = 5. Observe that whenever p is small enough, 
P(p, k) > for all values of k G (0, 1) and the approximation ratio converges to one as p goes to zero. 
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This means that the detection threshold 6 of the statistic tp(p) remains finite when n goes to infinity in the 
proportional regime. By contrast, the detection threshold of the MDP statistic in [Berthet and Rigollet, 2012] 
blows up to infinity as soon when k goes to infinity in this scenario. 




Figure 2. Level sets of /3(p, n) for A = 5. 



4.3. Detection thresholds for tjj(p) and A max (-). The fact that ip(0) = A max (S) means the statistic ip(p) 
should always perform better than A max (-) for at least some values of p. However, we notice in Figure 2 that 
(3(/jl, k) goes to zero as k goes to zero, which is a direct consequence of our choice of p in (15). Our choice 
of p is optimal for the statistic 0(p) but not for ip(p) and the main issue here is that we cannot explicitly 
maximize the difference n — (3(p, k)^ 1 tq as a function of p. On the other hand, it is easy to show that a 
better guess for p, when both k and p are small, is to pick 

P = ~, (20) 

n 

in which case, one can show that the detection threshold for 9$ becomes 

while the approximation ratio f3(p, k) is given by #(?/>(l/n)) which is of order one. This means that the 
detection threshold for tp(p) is controlled by 

( 1 + ^) k + T^a"( 1 + ^) k + ^ a 

when both k and pA are small. On the other hand, [Benaych-Georges et al., 201 1] show that, in our regime, 
the statistic A max (-) can only distinguish between T-Lq and T~Li when 9 is larger than 

Vp + p, 

which means that even for our suboptimal choice of p, the statistic ip(p) outperforms A(-) by a factor A^/ju. 

5. Algorithms 

The approximation performance we studied in the previous section comes at a price. While the semi- 
definite program (5) is tractable, its complexity is significantly higher than that of the simpler relaxations 
derived in [d'Aspremont et al., 2007], and very significantly higher than the MDP statistic in [Berthet and 
Rigollet, 2012] for example. d'Aspremont et al. [2008] derived greedy algorithms to produce good upper 
bounds on tp(p) from approximate dual solutions to problem (5), but there are of course no guarantees on the 
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quality of their output. In this section, we describe a simple algorithm to compute ip(p), with comparatively 
low storage and iteration costs. 

Recall from [d'Aspremont et al., 2008] that the dual to problem (5) is written 

minimize A max (Ya=i Y i) 

subject to Yi >z cLiaJ — pi (22) 
Yi >z 0, i = 1, . . . , n 

in the variables Yi G S n , where p > and dj £ M n are defined in Section 2. We first show how to regularize 
this problem, then discuss how to solve the regularized instance using a Frank- Wolfe type algorithm. 

5.1. Smoothing. Problem (22) is not smooth but we can write 

A max (Y) = max Tr{YX) 
Tr(X)=l,XhO 

and as in [Ben-Tal and Nemirovski, 2005], a natural way to regularize A max (-) is to add a (strongly convex) 
matrix entropy penalty to this variational formulation. We also add an explicit lower bound on the eigenval- 
ues of X to ensure that the gradient matrix X is invertible and well conditioned. We summarize this in the 
next lemma. 

Lemma 5.1. Let e > 0, the function 

f(Y)= max Tr(YX) - — — (Tr(X log(X)) + logn) (23) 

Tr(X)=l, logn 
X>{e/n)l 

has a Lipschitz continuous gradient with constant 

log n 

e 

with respect to the trace norm and satisfies 

(1 - e)A max 0n - e < f(Y) < A max (y). 

for all Y £ S n . 

Proof. In the spectahedron setting, we know from [Ben-Tal and Nemirovski, 2005] that the matrix entropy 
function 

d(X) = Tr(Xlog(X)) +logn 

is strongly convex with parameter 1/2 with respect to the trace norm (the dual of the spectral norm) and 
satisfies 

max d(X) < logn. 

Tr(X)=l, J(>0 

Then, [Nesterov, 2005, Thm. 1] shows that Vf(Y) is Lipschitz continuous with constant logn/e with re- 
spect to the trace norm. By construction, we have f(Y) < A max (y) and 

(1 - e)A max (r) - e < max Tr(YX) — e < f(Y) 
Tr(X)=l, 
Xy(e/n)l 

hence the desired result. ■ 

We can thus form a smooth approximation of problem (22), written 

minimize / {YJLi Y i) 

subject to Yi y didf — pi (25) 
Yi >z 0, i = 1, . . . , n 

in the variables Yi £ S n , where / is the smooth approximation of the function A max (-) defined in (23) and 
solve this problem using Algorithm 1. 

n 



5.2. Complexity. We first show how to efficiently compute both f(Z) and Vf(Z). 
Lemma 5.2. Assume Z = V di&g(y)V T and suppose A € M solves 

m > mm < log— , pe > (26) 

» Z — ^ rj n rj 



mm 

A In n n 

i=l v 

Y = V di&g(x)V T solves the maximization problem in (23), where 

( e Hi±^_il 
Xj = max < — , e > , i = l,...,n 
{n J 

ant/ we Ziave V/(Z) = Y. 

Proof. The function /(^) is spectral so solving (23) is equivalent to solving 

n 

max y T x — Xi log Xi 

x T x=l, logn^ 



x>(e/n) 1 



whose dual is 



min max (w — Xl) T x — X{ log x\ — A 

A x>e/n logn ' 

which is equivalent to (26). Now, the fact that f(Z) is a maximum of affine functions of Z shows that 
Vf(Z) = Y. m 

Besides computing the gradient, the main cost at each iteration is the problem of solving the n subprob- 
lems (24). We will see that when Vf(Z) is positive definite, then this can be done in closed form, with 
complexity 0(n 2 log n). Furthermore, the matrices Y{ do not need to be stored, only their sum is required. 

Lemma 5.3. Given X € S n such that X >~ 0, together with Bi = aiaj — pi. for some p > 0, we have 

min Tr(XY) = Tr{X 1 / 2 B i X 1/2 ) + 

and the optimal solution has rank one and is given by 

Y = X- l l 2 vv T X l l 2 Bi, ifp < ||oi||l, 
otherwise, 

where v is the leading eigenvector of the matrix X 1 / 2 BiX 1 / 2 . 

Algorithm 1 Frank- Wolfe algorithm for computing ip(p). 

Input: p > and a feasible starting point Z$. 
l: for k = 1 to N max do 

2: Compute X = V/(Z), together with X~ l and X 1 / 2 . 
3: Solve the n subproblems 

minimize TV(YjX) 

subject to Yi y didf — pi (24) 
Yt h 0, 

in the variables Yi € S„ for i = 1, . . . , n. 
4: Compute W = £? =1 Y i- 
5: Update the current point, with 



Z k = I 1 - -, ) Z k _ x + -z—W, 

l k + 2 J k + 2 



6: end for 
Output: A matrix Z £ S n . 
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Proof. Recall from d'Aspremont et al. [2008] that 

Tr(X 1 / 2 B i X 1 / 2 )+ = max Tr(PBi) 

= min Tr(YX), 

{y^b, y^o} 

and a solution to the dual of problem (24) can be obtained from the solution to 

max Tr{QX l / 2 B i X 1 / 2 ). 

By Sylvester's theorem, when p < ||a^ the matrix X x l 2 BiX x l 2 has exactly one nonnegative eigenvalue, 
so the optimal solution to this last problem is Q = vv T where v is the leading eigenvector of the matrix 
X 1 l 2 B i X 1 / 2 . This means that the optimal dual solution is P = X x l 2 vv T X 1 ! 2 . Finally, the KKT optimality 
conditions impose XY = PBi, which together with X y yields the desired result. ■ 

The last lemma shows that solving problem (24) requires the following steps. Assume X^ 1 and X 1 / 2 
have been precomputed, we first form the matrix X l / 2 BiX 1 / 2 at cost 0(n 2 ). We then compute its leading 
eigenvector at cost 0(n 2 log n) and form the rank one matrix P = X l / 2 vv T X 1 / 2 at cost 0{n 2 ). Because 
P is rank one, computing X~ x PBi also costs 0{n 2 ). This means that the total cost of solving problem (24) 
is bounded by 0(n 2 logn). Furthermore, without loss of generality, we can restrict the matrices Yi to have 
norm less than 



B 

/ j £ 

2 

which means we can assume the feasible set of problem (25) is compact. 



rt 

i=i 



Proposition 5.4. Assume A £ M. nxn and p > 0, Algorithm 1 will produce and e solution to problem (25) in 

D 2 log n N 



O 



e 2 



iterations, where D is the trace norm diameter of the feasible set 

D = diam j^Jl- -.YihB^ih 0, ||1-|| 2 < B j . 

Each iteration has complexity 0(n 3 logn) and storage cost 0(n 2 ). 

Proof. We use the complexity bounds in [Frank and Wolfe, 1956, Clarkson, 2010, Jaggi, 2011] for 
example and Lemma (5. 1) to control the curvature of /. ■ 



6. Numerical Experiments 

We test the detection procedure based on ip(p) described in (18). We generate 3000 experiments, where 
m points xi G W 2 are sampled under both hypotheses, with 

f H : x~AA(0,I n ) 

\ Ui : x ~ M (0, I n + 9vv T ) 

as in (10). In each experiment, we pick the leading dimension n = 100, the number of samples m = 50 and 
the cardinality k = 20. We set 9 = 3, Vi = 1/Vk when i G [l,k] and zero otherwise. In Figure 3 we plot 
the distributions of the test statistic if>(p) defined in (18), the MDP statistic in [Berthet and Rigollet, 2012], 
the A m ax(0 statistic and the diagonal statistic in [Amini and Wainwright, 2009]. As in [Berthet and Rigollet, 
2012], we observe that all the test statistics perform very similarly, except for the diagonal test. 
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8 10 12 14 16 

MDPh 



8 10 12 14 




Amax(S) max(diag(S)) 

FIGURE 3. Distribution of test statistic ip{p) (top left), the MDP k statistic in [Berthet and 
Rigollet, 2012] (top right), the A max (-) statistic (bottom left) and the diagonal statistic from 
[Amini and Wainwright, 2009] (bottom right) under both Hq and %\. All experiments 
are performed on random Gaussian matrices with ambient dimension n = 100, m = 50 
samples and cardinality for v under Hi set to k = 20. 
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